{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# Google Gemini Vision in FiftyOne"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-1",
   "metadata": {},
   "source": [
    "The rapid advancement of multimodal AI models has opened new possibilities for computer vision workflows. Google's [Gemini Vision](https://ai.google.dev/gemini-api/docs/vision) models combine powerful visual understanding with natural language processing, enabling sophisticated image analysis, generation, and manipulation tasks.\n",
    "\n",
    "![editing_images](https://cdn.voxel51.com/tutorial_gemini_vision/editing_images.webp)\n",
    "\n",
    "The [Gemini Vision Plugin](https://docs.voxel51.com/plugins/plugins_ecosystem/gemini_vision_plugin.html) for FiftyOne brings these capabilities directly into your data-centric workflows, allowing you to leverage Gemini's vision-language models for dataset analysis, augmentation, and quality improvement.\n",
    "\n",
    "In this tutorial, we'll demonstrate how to use the Gemini Vision Plugin with FiftyOne to analyze a real-world autonomous driving dataset, identify dataset issues, and use Gemini's generative capabilities to improve data quality.\n",
    "\n",
    "Specifically, this walkthrough covers:\n",
    "\n",
    "* Installing and configuring the Gemini Vision Plugin for FiftyOne\n",
    "* Loading the KITTI autonomous driving dataset\n",
    "* Analyzing dataset quality and identifying biases using FiftyOne Brain\n",
    "* Using Gemini Vision to query and understand images\n",
    "* Detecting missing classes and annotation gaps\n",
    "* Generating new training images with text-to-image\n",
    "* Editing existing images to address dataset gaps\n",
    "* Transferring styles between images\n",
    "* Analyzing video content with Gemini's video understanding capabilities\n",
    "\n",
    "**So, what's the takeaway?**\n",
    "\n",
    "By combining FiftyOne's dataset analysis capabilities with Gemini Vision's multimodal AI features, you can build a powerful workflow for understanding, improving, and augmenting your computer vision datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "## What is Google Gemini Vision?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-3",
   "metadata": {},
   "source": [
    "[Google Gemini](https://deepmind.google/technologies/gemini/) is a family of multimodal AI models developed by Google DeepMind. Gemini Vision extends these models' capabilities to understand and generate visual content:\n",
    "\n",
    "* **Multimodal Understanding**: Process both images and text together for deep contextual understanding\n",
    "* **1M Token Context Window**: Analyze large amounts of visual and textual data in a single request (with Gemini 3.0)\n",
    "* **Image Generation**: Create new images from text descriptions\n",
    "* **Image Editing**: Modify existing images based on natural language instructions\n",
    "* **Video Understanding**: Analyze and query video content with temporal awareness\n",
    "* **Adjustable Reasoning**: Control the depth of analysis with configurable thinking levels\n",
    "\n",
    "The Gemini Vision Plugin makes these capabilities accessible directly within your FiftyOne workflows, enabling seamless integration of generative AI into your data preparation pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5",
   "metadata": {},
   "source": [
    "To get started, you need to install [FiftyOne](https://docs.voxel51.com/getting_started/install.html) and the Gemini Vision Plugin:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install fiftyone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!fiftyone plugins download https://github.com/AdonaiVera/gemini-vision-plugin"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-8",
   "metadata": {},
   "source": [
    "### Configure Gemini API Access"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "To use the Gemini Vision Plugin, you'll need a Google Cloud account with the Gemini API enabled. \n",
    "\n",
    "**Important**: The Gemini API requires billing to be enabled on your Google Cloud account. You can get started at [Google AI Studio](https://aistudio.google.com/app/apikey).\n",
    "\n",
    "Once you have your API key, set it as an environment variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Set your Gemini API key\n",
    "os.environ[\"GEMINI_API_KEY\"] = \"YOUR_GEMINI_API_KEY\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-11",
   "metadata": {},
   "source": [
    "Now import FiftyOne and related modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12",
   "metadata": {},
   "outputs": [],
   "source": [
    "import fiftyone as fo\n",
    "import fiftyone.zoo as foz\n",
    "import fiftyone.brain as fob\n",
    "from fiftyone import ViewField as F"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "## Load the KITTI Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-14",
   "metadata": {},
   "source": [
    "For this tutorial, we'll use the [KITTI Dataset](https://www.cvlibs.net/datasets/kitti/), a large-scale diverse driving dataset containing 7,481 annotated images, and the test split contains 7,518 unlabeled images, across various weather conditions, times of day, and scenes. \n",
    "\n",
    "KITTI is perfect for demonstrating Gemini Vision's capabilities because:\n",
    "* It contains diverse real-world scenarios\n",
    "* It has complex multi-object scenes\n",
    "* It's used for autonomous driving research, where dataset quality is critical\n",
    "* It may contain annotation biases and gaps that we can identify and address\n",
    "\n",
    "Downloading the dataset for the first time can take around 30 minutes, and for this tutorial, we'll use a subset of the training split. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-15",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = foz.load_zoo_dataset(\n",
    "    \"kitti\",\n",
    "    dataset_name = \"gemini-vision-tutorial\",\n",
    "    split=\"train\",\n",
    "    persistent=True,\n",
    "    max_samples=100,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-16",
   "metadata": {},
   "source": [
    "Let's visualize the dataset in the FiftyOne App:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-17",
   "metadata": {},
   "outputs": [],
   "source": [
    "session = fo.launch_app(dataset, port=5149)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53b1efb7",
   "metadata": {},
   "source": [
    "![intial_notebook](https://cdn.voxel51.com/tutorial_gemini_vision/intial_notebook.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-18",
   "metadata": {},
   "source": [
    "## Analyzing Dataset Quality with FiftyOne Brain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "Before we start using Gemini Vision, let's analyze our dataset to understand its characteristics and identify potential issues. FiftyOne Brain provides powerful capabilities for dataset analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-20",
   "metadata": {},
   "source": [
    "### Identifying Class Imbalance and Bias"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-21",
   "metadata": {},
   "source": [
    "First, let’s examine the distribution of object classes in our dataset to identify any biases or underrepresented categories. You can also explore these insights with the [Dashboard plugin](https://docs.voxel51.com/plugins/plugins_ecosystem/dashboard.html), which lets you build custom dashboards to visualize key statistics about your dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cell-22",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Object Class Distribution:\n",
      "==================================================\n",
      "Car: 28742\n",
      "DontCare: 11295\n",
      "Pedestrian: 4487\n",
      "Van: 2914\n",
      "Cyclist: 1627\n",
      "Truck: 1094\n",
      "Misc: 973\n",
      "Tram: 511\n",
      "Person_sitting: 222\n"
     ]
    }
   ],
   "source": [
    "# Count the distribution of object classes in detections\n",
    "class_counts = dataset.count_values(\"ground_truth.detections.label\")\n",
    "\n",
    "# Display class distribution\n",
    "print(\"Object Class Distribution:\")\n",
    "print(\"=\" * 50)\n",
    "for cls, count in sorted(class_counts.items(), key=lambda x: x[1], reverse=True):\n",
    "    print(f\"{cls}: {count}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-25",
   "metadata": {},
   "source": [
    "### Computing Dataset Uniqueness"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-26",
   "metadata": {},
   "source": [
    "Next, let's use FiftyOne Brain to identify unique and potentially redundant samples in our dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-27",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute uniqueness scores\n",
    "fob.compute_uniqueness(dataset)\n",
    "\n",
    "# Sort by uniqueness to see most and least unique samples\n",
    "unique_view = dataset.sort_by(\"uniqueness\", reverse=True)\n",
    "\n",
    "print(f\"Most unique samples (uniqueness > 0.9): {len(dataset.match(F('uniqueness') > 0.9))}\")\n",
    "print(f\"Potentially redundant samples (uniqueness < 0.3): {len(dataset.match(F('uniqueness') < 0.3))}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ead22ffc",
   "metadata": {},
   "source": [
    "![compute_uniqueness](https://cdn.voxel51.com/tutorial_gemini_vision/compute_uniqueness.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-28",
   "metadata": {},
   "source": [
    "### Detecting Near-Duplicate Images"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-29",
   "metadata": {},
   "source": [
    "Duplicate or near-duplicate images can inflate evaluation metrics and waste training time. Let's find them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-30",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Detect near-duplicate images\n",
    "results = fob.compute_similarity(\n",
    "    dataset,\n",
    "    model=\"clip-vit-base32-torch\",\n",
    "    brain_key=\"img_sim\"\n",
    ")\n",
    "\n",
    "# Find potential duplicates\n",
    "dup_view = dataset.sort_by(\"uniqueness\").limit(20)\n",
    "session.view = dup_view"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-31",
   "metadata": {},
   "source": [
    "### Visualizing Dataset Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-32",
   "metadata": {},
   "source": [
    "Let's compute embeddings and visualize the dataset structure to identify clusters and potential gaps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0135f1d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install umap-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute visualization with embeddings\n",
    "results = fob.compute_visualization(\n",
    "    dataset,\n",
    "    model=\"clip-vit-base32-torch\",\n",
    "    brain_key=\"img_viz\",\n",
    "    method=\"umap\"\n",
    ")\n",
    "\n",
    "# Launch app to view the embeddings visualization\n",
    "session = fo.launch_app(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e047bbbe",
   "metadata": {},
   "source": [
    "![umap](https://cdn.voxel51.com/tutorial_gemini_vision/umap.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-34",
   "metadata": {},
   "source": [
    "The embeddings plot in the FiftyOne App reveals clustering patterns in the data. Isolated samples or sparse regions may indicate underrepresented scenarios that need more data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-35",
   "metadata": {},
   "source": [
    "## Using Gemini Vision for Image Understanding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-36",
   "metadata": {},
   "source": [
    "Now let's use the Gemini Vision Plugin to query and understand images in our dataset. The plugin provides several operators that can be accessed through the FiftyOne App or programmatically.\n",
    "\n",
    "### Querying Images with Natural Language"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-37",
   "metadata": {},
   "source": [
    "Let's select a few samples and use Gemini to analyze them. First, we'll select some samples with specific objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select samples with cars\n",
    "car_view = dataset.filter_labels(\"ground_truth\", F(\"label\") == \"Car\").take(5)\n",
    "\n",
    "# View in the app\n",
    "session.view = car_view"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f41411d",
   "metadata": {},
   "source": [
    "![query_images](https://cdn.voxel51.com/tutorial_gemini_vision/query_images.webp)|"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-39",
   "metadata": {},
   "source": [
    "Now, you can use the Gemini Vision Plugin operators from the FiftyOne App:\n",
    "\n",
    "1. Select one or more samples in the App\n",
    "2. Press the backtick key (\\`) to open the operator browser\n",
    "3. Search for \"query_gemini_vision\" or \"Query Gemini Vision\"\n",
    "4. Enter your query, for example:\n",
    "   - \"Describe the weather and lighting conditions in this image\"\n",
    "   - \"What time of day does this appear to be?\"\n",
    "   - \"Are there any pedestrians or cyclists visible?\"\n",
    "   - \"Describe potential safety hazards in this driving scene\"\n",
    "\n",
    "The plugin will use Gemini Vision to analyze the image and return a text response, which you can save to a custom field in your dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-40",
   "metadata": {},
   "source": [
    "### Identifying Missing Annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-41",
   "metadata": {},
   "source": [
    "One powerful use case for Gemini Vision is identifying objects that may be missing from annotations. Let's use it to audit our annotations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-42",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select a sample to analyze\n",
    "sample = dataset.first()\n",
    "\n",
    "# Get the list of currently annotated classes\n",
    "annotated_classes = [det.label for det in sample.ground_truth.detections]\n",
    "\n",
    "print(f\"Currently annotated classes: {set(annotated_classes)}\")\n",
    "print(\"\\nUse the Gemini Vision Query operator in the App to ask:\")\n",
    "print(\"'List all objects visible in this image, especially those that might be relevant for autonomous driving.'\")\n",
    "print(\"\\nCompare Gemini's response with the annotations to find missing objects.\")\n",
    "\n",
    "view = dataset.select([sample.id])\n",
    "session = fo.launch_app(view)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaa5212e",
   "metadata": {},
   "source": [
    "![missing_annotations](https://cdn.voxel51.com/tutorial_gemini_vision/missing_annotations.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-43",
   "metadata": {},
   "source": [
    "### Analyzing Difficult or Ambiguous Cases"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-44",
   "metadata": {},
   "source": [
    "Let's identify samples with many objects that might be challenging to annotate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find samples with many objects\n",
    "dataset.compute_metadata()\n",
    "\n",
    "# Count detections per sample\n",
    "for sample in dataset:\n",
    "    if sample.ground_truth:\n",
    "        sample[\"num_objects\"] = len(sample.ground_truth.detections)\n",
    "    else:\n",
    "        sample[\"num_objects\"] = 0\n",
    "    sample.save()\n",
    "\n",
    "# View samples with the most objects\n",
    "complex_view = dataset.sort_by(\"num_objects\", reverse=True).limit(10)\n",
    "session.view = complex_view\n",
    "\n",
    "print(\"These complex scenes are good candidates for Gemini Vision analysis to:\")\n",
    "print(\"- Verify annotation completeness\")\n",
    "print(\"- Identify occlusions and difficult objects\")\n",
    "print(\"- Understand scene context and relationships between objects\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6645b5c3",
   "metadata": {},
   "source": [
    "![complex_scene](https://cdn.voxel51.com/tutorial_gemini_vision/complex_scene.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-46",
   "metadata": {},
   "source": [
    "## Detecting Missing Classes and Coverage Gaps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-47",
   "metadata": {},
   "source": [
    "Based on our class distribution analysis, we may have identified underrepresented object classes. Let's systematically find which classes are missing or underrepresented:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "cell-48",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total classes in dataset: 9\n",
      "\n",
      "Rare classes (< 5 instances):\n",
      "\n",
      "Most common classes:\n",
      "  - Car: 28742 instances\n",
      "  - DontCare: 11295 instances\n",
      "  - Pedestrian: 4487 instances\n",
      "  - Van: 2914 instances\n",
      "  - Cyclist: 1627 instances\n"
     ]
    }
   ],
   "source": [
    "# Get all classes present in the dataset\n",
    "class_counts = dataset.count_values(\"ground_truth.detections.label\")\n",
    "all_classes = sorted(class_counts.keys())\n",
    "\n",
    "# Define a threshold for rare classes\n",
    "threshold = 5  # Consider classes with fewer than 5 instances as rare\n",
    "\n",
    "# Identify rare classes\n",
    "rare_classes = []\n",
    "for cls, count in class_counts.items():\n",
    "    if count < threshold:\n",
    "        rare_classes.append((cls, count))\n",
    "\n",
    "print(f\"Total classes in dataset: {len(all_classes)}\")\n",
    "print(f\"\\nRare classes (< {threshold} instances):\")\n",
    "for cls, count in sorted(rare_classes, key=lambda x: x[1]):\n",
    "    print(f\"  - {cls}: {count} instances\")\n",
    "\n",
    "print(\"\\nMost common classes:\")\n",
    "for cls, count in sorted(class_counts.items(), key=lambda x: x[1], reverse=True)[:5]:\n",
    "    print(f\"  - {cls}: {count} instances\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-49",
   "metadata": {},
   "source": [
    "### Identifying Scenario Coverage Gaps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-50",
   "metadata": {},
   "source": [
    "Beyond object classes, we should also consider scenario diversity. Let's use Gemini Vision to categorize our images by scenario characteristics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "cell-51",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Use the Gemini Vision Query operator to categorize these samples by:\n",
      "\n",
      "1. Weather conditions:\n",
      "   Query: 'What are the weather conditions? Classify as: clear, rainy, foggy, snowy, or cloudy.'\n",
      "\n",
      "2. Time of day:\n",
      "   Query: 'What time of day is this? Classify as: dawn, day, dusk, or night.'\n",
      "\n",
      "3. Scene type:\n",
      "   Query: 'What type of driving environment is this? Classify as: highway, urban, residential, or rural.'\n",
      "\n",
      "Save responses to custom fields like 'gemini_weather', 'gemini_time', 'gemini_scene'\n",
      "Then use FiftyOne's count_values() to identify underrepresented scenarios.\n"
     ]
    }
   ],
   "source": [
    "# Sample a subset for scenario analysis\n",
    "analysis_view = dataset.take(50)\n",
    "\n",
    "print(\"Use the Gemini Vision Query operator to categorize these samples by:\")\n",
    "print(\"\\n1. Weather conditions:\")\n",
    "print(\"   Query: 'What are the weather conditions? Classify as: clear, rainy, foggy, snowy, or cloudy.'\")\n",
    "print(\"\\n2. Time of day:\")\n",
    "print(\"   Query: 'What time of day is this? Classify as: dawn, day, dusk, or night.'\")\n",
    "print(\"\\n3. Scene type:\")\n",
    "print(\"   Query: 'What type of driving environment is this? Classify as: highway, urban, residential, or rural.'\")\n",
    "print(\"\\nSave responses to custom fields like 'gemini_weather', 'gemini_time', 'gemini_scene'\")\n",
    "print(\"Then use FiftyOne's count_values() to identify underrepresented scenarios.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-52",
   "metadata": {},
   "source": [
    "## Addressing Dataset Gaps with Image Generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-53",
   "metadata": {},
   "source": [
    "Now that we've identified missing classes and underrepresented scenarios, let's use Gemini's text-to-image generation capabilities to create synthetic training data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-54",
   "metadata": {},
   "source": [
    "### Generating Images for Missing Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-55",
   "metadata": {},
   "source": [
    "The Gemini Vision Plugin includes a text-to-image generation operator. You can use it from the FiftyOne App:\n",
    "\n",
    "1. Open the operator browser (backtick key)\n",
    "2. Search for \"generate_image\" or \"Generate Image\"\n",
    "3. Enter prompts for missing or rare classes:\n",
    "\n",
    "**Example prompts for autonomous driving scenarios:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "cell-56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Use these prompts with the Gemini Generate Image operator:\n",
      "\n",
      "fire_hydrant:\n",
      "  A city street scene with a fire hydrant in the foreground, cars parked on the side, shot from a dashboard camera perspective\n",
      "\n",
      "motorcycle:\n",
      "  A motorcyclist riding on a highway during sunset, viewed from a car's perspective\n",
      "\n",
      "cyclist_rain:\n",
      "  A residential street with a cyclist and a stop sign, rainy weather, dashboard camera view\n",
      "\n",
      "night_traffic:\n",
      "  A busy urban intersection at night with traffic lights, pedestrians crossing, and various vehicles\n",
      "\n",
      "foggy_highway:\n",
      "  A foggy morning highway scene with trucks and cars, limited visibility\n"
     ]
    }
   ],
   "source": [
    "# Example: Define prompts for missing scenarios programmatically\n",
    "generation_prompts = {\n",
    "    \"fire_hydrant\": \"A city street scene with a fire hydrant in the foreground, cars parked on the side, shot from a dashboard camera perspective\",\n",
    "    \"motorcycle\": \"A motorcyclist riding on a highway during sunset, viewed from a car's perspective\",\n",
    "    \"cyclist_rain\": \"A residential street with a cyclist and a stop sign, rainy weather, dashboard camera view\",\n",
    "    \"night_traffic\": \"A busy urban intersection at night with traffic lights, pedestrians crossing, and various vehicles\",\n",
    "    \"foggy_highway\": \"A foggy morning highway scene with trucks and cars, limited visibility\"\n",
    "}\n",
    "\n",
    "print(\"Use these prompts with the Gemini Generate Image operator:\")\n",
    "for name, prompt in generation_prompts.items():\n",
    "    print(f\"\\n{name}:\")\n",
    "    print(f\"  {prompt}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03557409",
   "metadata": {},
   "source": [
    "![generate_images](https://cdn.voxel51.com/tutorial_gemini_vision/generate_images.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-60",
   "metadata": {},
   "source": [
    "## Editing Images to Augment Dataset Diversity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-61",
   "metadata": {},
   "source": [
    "In addition to generating new images, Gemini Vision can edit existing images based on natural language instructions. This is useful for creating variations and augmenting dataset diversity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-62",
   "metadata": {},
   "source": [
    "### Using the Image Editing Operator"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-63",
   "metadata": {},
   "source": [
    "To edit images with Gemini Vision:\n",
    "\n",
    "1. Select a single sample in the FiftyOne App\n",
    "2. Open the operator browser (backtick key)\n",
    "3. Search for \"edit_image\" or \"Edit Image\"\n",
    "4. Enter editing instructions\n",
    "\n",
    "**Example editing prompts:**\n",
    "The edited image will be saved with the original prompt preserved in metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-64",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select samples with clear weather for editing\n",
    "clear_weather_view = dataset.take(10)\n",
    "session.view = clear_weather_view\n",
    "\n",
    "print(\"Selected samples for editing. Use the Edit Image operator with prompts like:\")\n",
    "print(\"  - 'Change the weather to rainy, add rain and wet roads'\")\n",
    "print(\"  - 'Make it nighttime with street lights illuminated'\")\n",
    "print(\"  - 'Add fog to reduce visibility'\")\n",
    "print(\"\\nThis creates weather and lighting variations to improve model robustness.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "055f958a",
   "metadata": {},
   "source": [
    "![editing_images](https://cdn.voxel51.com/tutorial_gemini_vision/editing_images.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-65",
   "metadata": {},
   "source": [
    "## Transferring Styles Between Images"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-66",
   "metadata": {},
   "source": [
    "Gemini Vision can combine multiple images to create new scenes or transfer styles. This is useful for:\n",
    "* Transferring weather conditions from one image to another\n",
    "* Combining objects from different scenes\n",
    "* Creating composite training examples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-67",
   "metadata": {},
   "source": [
    "### Using Multi-Image Composition"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-68",
   "metadata": {},
   "source": [
    "To use multi-image composition:\n",
    "\n",
    "1. Select 2-3 samples in the FiftyOne App\n",
    "2. Open the operator browser\n",
    "3. Search for \"compose_images\" or \"Multi-Image Composition\"\n",
    "4. Enter composition instructions\n",
    "\n",
    "**Example composition prompts:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "cell-69",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Multi-Image Composition Workflow:\n",
      "\n",
      "1. Find an image with desired weather/lighting conditions\n",
      "2. Find an image with desired scene/object composition\n",
      "3. Select both images\n",
      "4. Use Multi-Image Composition operator to combine them\n",
      "\n",
      "Example: Transfer nighttime lighting to a daytime scene,\n",
      "creating a diverse set of lighting conditions for training.\n"
     ]
    }
   ],
   "source": [
    "# This example shows the concept - actual execution is done through the App\n",
    "print(\"Multi-Image Composition Workflow:\")\n",
    "print(\"\\n1. Find an image with desired weather/lighting conditions\")\n",
    "print(\"2. Find an image with desired scene/object composition\")\n",
    "print(\"3. Select both images\")\n",
    "print(\"4. Use Multi-Image Composition operator to combine them\")\n",
    "print(\"\\nExample: Transfer nighttime lighting to a daytime scene,\")\n",
    "print(\"creating a diverse set of lighting conditions for training.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d716f43",
   "metadata": {},
   "source": [
    "![multi_image_composition](https://cdn.voxel51.com/tutorial_gemini_vision/multi_image_composition.webp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-70",
   "metadata": {},
   "source": [
    "## Video Understanding with Gemini Vision"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-71",
   "metadata": {},
   "source": [
    "Gemini Vision also supports video understanding, allowing you to analyze temporal sequences and extract insights from video data. This is particularly relevant for autonomous driving where temporal context matters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-72",
   "metadata": {},
   "source": [
    "### Loading Video Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-73",
   "metadata": {},
   "source": [
    "Let's load a video dataset to demonstrate Gemini's video understanding capabilities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-74",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the quickstart-video dataset\n",
    "video_dataset = foz.load_zoo_dataset(\n",
    "    \"quickstart-video\",\n",
    "    max_samples=5\n",
    ")\n",
    "\n",
    "video_dataset.name = \"gemini-vision-video\"\n",
    "video_dataset.persistent = True\n",
    "\n",
    "# Launch app to view videos\n",
    "session = fo.launch_app(video_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-75",
   "metadata": {},
   "source": [
    "### Querying Video Content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-76",
   "metadata": {},
   "source": [
    "The Gemini Vision Plugin includes a video understanding operator with multiple modes:\n",
    "\n",
    "1. **Describe**: Get a detailed description of the video content\n",
    "2. **Segment**: Identify temporal segments with different characteristics\n",
    "3. **Extract**: Extract specific information (objects, actions, events)\n",
    "4. **Question**: Ask specific questions about the video content\n",
    "\n",
    "To use video understanding:\n",
    "\n",
    "1. Select a video sample in the FiftyOne App\n",
    "2. Open the operator browser\n",
    "3. Search for \"analyze_video\" or \"Video Understanding\"\n",
    "4. Select the mode and enter your query\n",
    "\n",
    "**Example video queries:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "cell-77",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Video Understanding Queries:\n",
      "\n",
      "1. Mode: describe\n",
      "   Query: Provide a detailed description of this driving video including weather, traffic conditions, and notable events\n",
      "\n",
      "2. Mode: extract\n",
      "   Query: List all vehicle types that appear in this video with their approximate timestamps\n",
      "\n",
      "3. Mode: question\n",
      "   Query: Are there any potentially dangerous situations in this driving video?\n",
      "\n",
      "4. Mode: segment\n",
      "   Query: Segment this video by traffic density (low, medium, high)\n"
     ]
    }
   ],
   "source": [
    "# Example queries for video analysis\n",
    "video_queries = [\n",
    "    {\n",
    "        \"mode\": \"describe\",\n",
    "        \"query\": \"Provide a detailed description of this driving video including weather, traffic conditions, and notable events\"\n",
    "    },\n",
    "    {\n",
    "        \"mode\": \"extract\",\n",
    "        \"query\": \"List all vehicle types that appear in this video with their approximate timestamps\"\n",
    "    },\n",
    "    {\n",
    "        \"mode\": \"question\",\n",
    "        \"query\": \"Are there any potentially dangerous situations in this driving video?\"\n",
    "    },\n",
    "    {\n",
    "        \"mode\": \"segment\",\n",
    "        \"query\": \"Segment this video by traffic density (low, medium, high)\"\n",
    "    }\n",
    "]\n",
    "\n",
    "print(\"Video Understanding Queries:\")\n",
    "for i, q in enumerate(video_queries, 1):\n",
    "    print(f\"\\n{i}. Mode: {q['mode']}\")\n",
    "    print(f\"   Query: {q['query']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23652409",
   "metadata": {},
   "source": [
    "![video_understanding](https://cdn.voxel51.com/tutorial_gemini_vision/video_understanding.webp)\n",
    "\n",
    "Now it’s your turn, keep exploring Gemini Vision to pull more insights from the video. You can try things like analyzing temporal patterns, extracting temporal annotations, and more."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-78",
   "metadata": {},
   "source": [
    "### Analyzing Temporal Patterns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-79",
   "metadata": {},
   "source": [
    "Video understanding allows you to identify temporal patterns that aren't visible in individual frames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "cell-80",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Temporal Analysis Use Cases:\n",
      "\n",
      "1. Lane Changes:\n",
      "   Query: 'Identify all lane change maneuvers and their timestamps'\n",
      "\n",
      "2. Traffic Signal Compliance:\n",
      "   Query: 'Does the vehicle stop at all red lights? Provide timestamps.'\n",
      "\n",
      "3. Pedestrian Interactions:\n",
      "   Query: 'Identify all moments when pedestrians cross the road'\n",
      "\n",
      "4. Weather Changes:\n",
      "   Query: 'Does the weather change during this video? If so, when?'\n",
      "\n",
      "5. Scene Transitions:\n",
      "   Query: 'Segment this video by scene type (highway, urban, residential)'\n"
     ]
    }
   ],
   "source": [
    "print(\"Temporal Analysis Use Cases:\")\n",
    "print(\"\\n1. Lane Changes:\")\n",
    "print(\"   Query: 'Identify all lane change maneuvers and their timestamps'\")\n",
    "print(\"\\n2. Traffic Signal Compliance:\")\n",
    "print(\"   Query: 'Does the vehicle stop at all red lights? Provide timestamps.'\")\n",
    "print(\"\\n3. Pedestrian Interactions:\")\n",
    "print(\"   Query: 'Identify all moments when pedestrians cross the road'\")\n",
    "print(\"\\n4. Weather Changes:\")\n",
    "print(\"   Query: 'Does the weather change during this video? If so, when?'\")\n",
    "print(\"\\n5. Scene Transitions:\")\n",
    "print(\"   Query: 'Segment this video by scene type (highway, urban, residential)'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-81",
   "metadata": {},
   "source": [
    "### Extracting Temporal Annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-82",
   "metadata": {},
   "source": [
    "The responses from video understanding can be used to create temporal annotations in FiftyOne:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "cell-83",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Workflow for creating temporal annotations from Gemini responses:\n",
      "\n",
      "1. Use Video Understanding operator to segment video\n",
      "2. Gemini returns timestamps for each segment\n",
      "3. Convert timestamps to frame numbers\n",
      "4. Tag frames with segment characteristics\n",
      "\n",
      "Example: Tag frames as 'high_traffic', 'medium_traffic', 'low_traffic'\n",
      "based on Gemini's temporal segmentation\n"
     ]
    }
   ],
   "source": [
    "# Example: After using Gemini to segment a video, you can create frame-level tags\n",
    "# This is a conceptual example - actual implementation would parse Gemini's response\n",
    "\n",
    "print(\"Workflow for creating temporal annotations from Gemini responses:\")\n",
    "print(\"\\n1. Use Video Understanding operator to segment video\")\n",
    "print(\"2. Gemini returns timestamps for each segment\")\n",
    "print(\"3. Convert timestamps to frame numbers\")\n",
    "print(\"4. Tag frames with segment characteristics\")\n",
    "print(\"\\nExample: Tag frames as 'high_traffic', 'medium_traffic', 'low_traffic'\")\n",
    "print(\"based on Gemini's temporal segmentation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-109",
   "metadata": {},
   "source": [
    "## Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-110",
   "metadata": {},
   "source": [
    "In this tutorial, we've demonstrated how the Gemini Vision Plugin extends FiftyOne's capabilities with powerful multimodal AI features:\n",
    "\n",
    "**Dataset Analysis:**\n",
    "* Used FiftyOne Brain to identify class imbalances, duplicates, and coverage gaps\n",
    "* Leveraged Gemini Vision to audit annotations and identify missing labels\n",
    "* Classified images by scenario characteristics (weather, time, scene type)\n",
    "\n",
    "**Dataset Enhancement:**\n",
    "* Generated synthetic images for underrepresented classes and scenarios\n",
    "* Edited existing images to create weather and lighting variations\n",
    "* Transferred styles between images to augment dataset diversity\n",
    "\n",
    "**Video Understanding:**\n",
    "* Analyzed temporal patterns in driving videos\n",
    "* Extracted event timestamps and segmented videos by characteristics\n",
    "* Queried video content with natural language\n",
    "\n",
    "By combining FiftyOne's data-centric workflows with Gemini Vision's multimodal AI capabilities, you can build higher-quality, more diverse datasets that lead to more robust computer vision models.\n",
    "\n",
    "For more information:\n",
    "* [Gemini Vision Plugin Documentation](https://docs.voxel51.com/plugins/plugins_ecosystem/gemini_vision_plugin.html)\n",
    "* [FiftyOne Brain Documentation](https://docs.voxel51.com/user_guide/brain.html)\n",
    "* [Google Gemini API Documentation](https://ai.google.dev/gemini-api/docs)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
